Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Using Automated Error Profiling of Texts for Improved Selection of Correction Candidates for Garbled Tokens

Identifieur interne : 000D49 ( Main/Exploration ); précédent : 000D48; suivant : 000D50

Using Automated Error Profiling of Texts for Improved Selection of Correction Candidates for Garbled Tokens

Auteurs : Stoyan Mihov ; Petar Mitankin ; Annette Gotscharek [Allemagne] ; Ulrich Reffle [Allemagne] ; U. Schulz [Allemagne] ; Christoph Ringlstetter

Source :

RBID : ISTEX:EC49BC3BD280658ED40AD1041F0C4EEFD8C8CDE5

Abstract

Abstract: Lexical text correction systems are typically based on a central step: when finding a malformed token in the input text, a set of correction candidates for the token is retrieved from the given background dictionary. In previous work we introduced a method for the selection of correction candidates which is fast and leads to small candidate sets with high recall. As a prerequisite, ground truth data were used to find a set of important substitutions, merges and splits that represent characteristic errors found in the text. This prior knowledge was then used to fine-tune the meaningful selection of correction candidates. Here we show that an appropriate set of possible substitutions, merges and splits for the input text can be retrieved without any ground truth data. In the new approach, we compute an error profile of the erroneous input text in a fully automated way, using so-called error dictionaries. From this profile, suitable sets of substitutions, merges and splits are derived. Error profiling with error dictionaries is simple and very fast. As an overall result we obtain an adaptive form of candidate selection which is very efficient, does not need ground truth data and leads to small candidate sets with high recall.

Url:
DOI: 10.1007/978-3-540-76928-6_47


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI wicri:istexFullTextTei="biblStruct:series">
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">Using Automated Error Profiling of Texts for Improved Selection of Correction Candidates for Garbled Tokens</title>
<author>
<name sortKey="Mihov, Stoyan" sort="Mihov, Stoyan" uniqKey="Mihov S" first="Stoyan" last="Mihov">Stoyan Mihov</name>
</author>
<author>
<name sortKey="Mitankin, Petar" sort="Mitankin, Petar" uniqKey="Mitankin P" first="Petar" last="Mitankin">Petar Mitankin</name>
</author>
<author>
<name sortKey="Gotscharek, Annette" sort="Gotscharek, Annette" uniqKey="Gotscharek A" first="Annette" last="Gotscharek">Annette Gotscharek</name>
</author>
<author>
<name sortKey="Reffle, Ulrich" sort="Reffle, Ulrich" uniqKey="Reffle U" first="Ulrich" last="Reffle">Ulrich Reffle</name>
</author>
<author>
<name sortKey="Schulz, U" sort="Schulz, U" uniqKey="Schulz U" first="U." last="Schulz">U. Schulz</name>
</author>
<author>
<name sortKey="Ringlstetter, Christoph" sort="Ringlstetter, Christoph" uniqKey="Ringlstetter C" first="Christoph" last="Ringlstetter">Christoph Ringlstetter</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:EC49BC3BD280658ED40AD1041F0C4EEFD8C8CDE5</idno>
<date when="2007" year="2007">2007</date>
<idno type="doi">10.1007/978-3-540-76928-6_47</idno>
<idno type="url">https://api.istex.fr/document/EC49BC3BD280658ED40AD1041F0C4EEFD8C8CDE5/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000B68</idno>
<idno type="wicri:Area/Istex/Curation">000B53</idno>
<idno type="wicri:Area/Istex/Checkpoint">000764</idno>
<idno type="wicri:doubleKey">0302-9743:2007:Mihov S:using:automated:error</idno>
<idno type="wicri:Area/Main/Merge">000D62</idno>
<idno type="wicri:Area/Main/Curation">000D49</idno>
<idno type="wicri:Area/Main/Exploration">000D49</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title level="a" type="main" xml:lang="en">Using Automated Error Profiling of Texts for Improved Selection of Correction Candidates for Garbled Tokens</title>
<author>
<name sortKey="Mihov, Stoyan" sort="Mihov, Stoyan" uniqKey="Mihov S" first="Stoyan" last="Mihov">Stoyan Mihov</name>
<affiliation>
<wicri:noCountry code="subField">Sciences</wicri:noCountry>
</affiliation>
</author>
<author>
<name sortKey="Mitankin, Petar" sort="Mitankin, Petar" uniqKey="Mitankin P" first="Petar" last="Mitankin">Petar Mitankin</name>
<affiliation>
<wicri:noCountry code="subField">Sciences</wicri:noCountry>
</affiliation>
</author>
<author>
<name sortKey="Gotscharek, Annette" sort="Gotscharek, Annette" uniqKey="Gotscharek A" first="Annette" last="Gotscharek">Annette Gotscharek</name>
<affiliation wicri:level="4">
<country>Allemagne</country>
<placeName>
<settlement type="city">Munich</settlement>
<region type="land" nuts="1">Bavière</region>
<region type="district" nuts="2">District de Haute-Bavière</region>
</placeName>
<orgName type="university">Université Louis-et-Maximilien de Munich</orgName>
</affiliation>
</author>
<author>
<name sortKey="Reffle, Ulrich" sort="Reffle, Ulrich" uniqKey="Reffle U" first="Ulrich" last="Reffle">Ulrich Reffle</name>
<affiliation wicri:level="4">
<country>Allemagne</country>
<placeName>
<settlement type="city">Munich</settlement>
<region type="land" nuts="1">Bavière</region>
<region type="district" nuts="2">District de Haute-Bavière</region>
</placeName>
<orgName type="university">Université Louis-et-Maximilien de Munich</orgName>
</affiliation>
</author>
<author>
<name sortKey="Schulz, U" sort="Schulz, U" uniqKey="Schulz U" first="U." last="Schulz">U. Schulz</name>
<affiliation wicri:level="4">
<country>Allemagne</country>
<placeName>
<settlement type="city">Munich</settlement>
<region type="land" nuts="1">Bavière</region>
<region type="district" nuts="2">District de Haute-Bavière</region>
</placeName>
<orgName type="university">Université Louis-et-Maximilien de Munich</orgName>
</affiliation>
</author>
<author>
<name sortKey="Ringlstetter, Christoph" sort="Ringlstetter, Christoph" uniqKey="Ringlstetter C" first="Christoph" last="Ringlstetter">Christoph Ringlstetter</name>
<affiliation>
<wicri:noCountry code="subField">Alberta</wicri:noCountry>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series>
<title level="s">Lecture Notes in Computer Science</title>
<imprint>
<date>2007</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">EC49BC3BD280658ED40AD1041F0C4EEFD8C8CDE5</idno>
<idno type="DOI">10.1007/978-3-540-76928-6_47</idno>
<idno type="ChapterID">47</idno>
<idno type="ChapterID">Chap47</idno>
</biblStruct>
</sourceDesc>
<seriesStmt>
<idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass></textClass>
<langUsage>
<language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Abstract: Lexical text correction systems are typically based on a central step: when finding a malformed token in the input text, a set of correction candidates for the token is retrieved from the given background dictionary. In previous work we introduced a method for the selection of correction candidates which is fast and leads to small candidate sets with high recall. As a prerequisite, ground truth data were used to find a set of important substitutions, merges and splits that represent characteristic errors found in the text. This prior knowledge was then used to fine-tune the meaningful selection of correction candidates. Here we show that an appropriate set of possible substitutions, merges and splits for the input text can be retrieved without any ground truth data. In the new approach, we compute an error profile of the erroneous input text in a fully automated way, using so-called error dictionaries. From this profile, suitable sets of substitutions, merges and splits are derived. Error profiling with error dictionaries is simple and very fast. As an overall result we obtain an adaptive form of candidate selection which is very efficient, does not need ground truth data and leads to small candidate sets with high recall.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>Allemagne</li>
</country>
<region>
<li>Bavière</li>
<li>District de Haute-Bavière</li>
</region>
<settlement>
<li>Munich</li>
</settlement>
<orgName>
<li>Université Louis-et-Maximilien de Munich</li>
</orgName>
</list>
<tree>
<noCountry>
<name sortKey="Mihov, Stoyan" sort="Mihov, Stoyan" uniqKey="Mihov S" first="Stoyan" last="Mihov">Stoyan Mihov</name>
<name sortKey="Mitankin, Petar" sort="Mitankin, Petar" uniqKey="Mitankin P" first="Petar" last="Mitankin">Petar Mitankin</name>
<name sortKey="Ringlstetter, Christoph" sort="Ringlstetter, Christoph" uniqKey="Ringlstetter C" first="Christoph" last="Ringlstetter">Christoph Ringlstetter</name>
</noCountry>
<country name="Allemagne">
<region name="Bavière">
<name sortKey="Gotscharek, Annette" sort="Gotscharek, Annette" uniqKey="Gotscharek A" first="Annette" last="Gotscharek">Annette Gotscharek</name>
</region>
<name sortKey="Reffle, Ulrich" sort="Reffle, Ulrich" uniqKey="Reffle U" first="Ulrich" last="Reffle">Ulrich Reffle</name>
<name sortKey="Schulz, U" sort="Schulz, U" uniqKey="Schulz U" first="U." last="Schulz">U. Schulz</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000D49 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000D49 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:EC49BC3BD280658ED40AD1041F0C4EEFD8C8CDE5
   |texte=   Using Automated Error Profiling of Texts for Improved Selection of Correction Candidates for Garbled Tokens
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024